Filtering of complex systems using overlapping tree networks 
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We introduce a technique that is capable to filter out information from complex systems, by 
mapping them to networks, and extracting a subgraph with the strongest links. This idea is based 
on the Minimum Spanning Tree, and it can be applied to sets of graphs that have as links different 
sets of interactions among the system's elements, which are described as network nodes. It can also 
be applied to correlation-based graphs, where the links are weighted and represent the correlation 
strength between all pairs of nodes. We applied this method to the European scientific collaboration 
network, which is composed of all the projects supported by the European Framework Program FP6, 
and also to the correlation-based network of the 100 highest capitalized stocks traded in the NYSE. 
For both cases we identified meaningful structures, such as a strongly interconnected community 
of countries that play important role in the collaboration network, and clusters of stocks belonging 
to different sectors of economic activity, which gives significant information about the investigated 
systems. 



I. INTRODUCTION 

The study of complex systems, a name that is fre- 
quently used for systems having a large number of el- 
ements that interact in (usually) non trivial ways, has 
been greatly advanced in recent years by the use of 
graph theory p|. One can map a complex system to 
a complex network by representing the interacting ele- 
ments of the system with nodes and their interactions 
with links between the nodes. Examples of complex sys- 
tems that have been recently investigated in this per- 
spective include the Internet 0, the World Wide 
Web jj], communication networks M, food webs [1], 
sexual contacts among individuals [3], economic net- 
works i, i, 0, [H E 111 Q [ilL the network of collab- 
orations in EU funded projects jl6l.[T7|. etc. 

Complex systems are not always static, meaning that 
they may evolve dynamically over time. This evolution 
can provide a wealth of information about the processes 
driving the system. One way to study such systems is 
to record the time dependence of some specific and well 
defined property, and thus obtain a set of time series 
that are able to depict the time evolution of the entire 
system. These time series can be transformed into a 
graph by using a similarity measure based in the cross- 
correlation among its elements, as very effectively has 
been done, for example, for the study of equity portfo- 
lios [1, i, [m, E [ll, which has led to an established 
method for the investigation of financial systems. There- 
fore, we can use the cross-correlations between variables 
to create a correlation based network from any complex 
system that we have data available. On the other hand, 
if the system is static, meaning that its interactions do 
not change with time, but there are different kinds of 
interactions among its elements, we can create a static 
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graph for each different interaction, and study this set of 
graphs. 

In this work we describe an approach that can be used 
for both types of the aforementioned analyses. It is based 
on the Minimum Spanning Tree (MST) technique 0] , but 
it allows the creation of a more representative network 
of the system (compared to a simple MST) that main- 
tains information about its dynamics and its temporal 
evolution. As application, we use this method to an- 
alyze the European scientific collaboration network for 
the projects carried out with the support of the Frame- 
work Program FP6, and the network of the 100 highest 
capitalized stocks traded in the NYSE in the period 1995- 
2003. 



II. THE METHOD 

Any complex system with interacting elements can 
be mapped to a network with N nodes. The links con- 
necting the nodes of the network represent the interac- 
tions among the system elements, and the strength of 
these interactions is used as weights of the links between 
the nodes. The total number of links of a network de- 
pends on the information that we have about interac- 
tions between its nodes. For the special case that we 
know the interaction strength between all pairs of nodes, 
as we do in correlation based graphs, then the network 
is fully connected and has N{N — l)/2 number of links. 
For such cases it is essential to use filtering techniques 
in order to reduce the number of connections, so that we 
can study properties of the network that generally are 
hidden due to its complexity. The most drastic filtering 
of a network can be achieved by the extraction of the 
MST 1^, a technique that has been used extensively in 
the literature for the study of financial correlation ma- 
trices m [m, [13, 13 J leading to the identification of 
clusters of stocks that result in a meaningful taxonomy. 
The MST was recently used to extract the mode of col- 
laboration in research projects funded by the European 
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Commission Framework Programs [13] ■ The MST is a 
graph with the same number of nodes TV as the original 
network but having a total of only — 1 edges, with a 
total minimum weight. It is constructed starting with to 
the disconnected graph that contains all the nodes of the 
network, and then by adding links in increasing weight 
order, as long as they do not form loops, until all nodes 
are connected. Such structure is a much simpler graph 
than the fully connected network, but it still gives us 
interesting information about the system. 

Depending on the structure of the system under in- 
vestigation, and on the available data, we could create 
a series of representative networks that contain informa- 
tion about the evolution and the dynamics of the original 
system. For the case of networks that are stationary, we 
can construct different snapshots of them by using as 
links different interactions among the. This allows us to 
calculate MSTs separately for all different snapshots of 
the system. On the other hand, for the case of correla- 
tion based graphs, where the network is constructed us- 
ing time series data, we divide the time series into smaller 
segments (time windows) and construct different network 
snapshots of the system at different time periods. Let us 
name E the set of all the links present in each calculated 
MST, and let us assume that we have calculated a total 
of M MSTs from the M different snapshots of the origi- 
nal network that we can obtain. We can create the graph 
G that has as links the union of all sets E for all the A/ 
trees that we have calculated 

M 

G^[jE\ 

T = \ 

The number of links of graph G lies inside the interval 
[A^ - 1, M(iV - 1)], where iV - 1 is the number of links 
of one MST with N nodes and M(iV - 1) is the total 
number of links of M different MSTs. Obviously, the 
graph G will have — 1 links if the investigated system 
is so stable that for all the M investigated snapshots we 
find the same MST, while it will have M(iV - 1) links if 
the system is so unstable that for all the M investigated 
snapshots we find completely different MSTs. For the 
case of a highly unstable system, if we choose M > N/2, 
we will get a fully connected graph with N{N — l)/2 
links. 

In the MSTs that we calculate, each link has a weight, 
which is representative of the strength of the interaction 
between nodes i and j. We use these weights of the in- 
dividual MSTs to calculate the new weights of the com- 
bined graph G. We set the weight of a link between nodes 
i and j in graph G as the mean value of the strength that 
the same link has for the entire set of the M MSTs. Be- 
cause of this construction method, we name the resulting 
graph "Overlapping Tree Network" (OTN). A represen- 
tative example of the above procedure is given in Fig- 
ure [TJ In the following section, we will implement this 
technique to the different MSTs that we obtain from the 
European collaboration network of the FP6, and to the 




FIG. 1: An example of creation of the Overlapping Tree Net- 
work (OTN). Three trees - (a), (b), and (c) - are combined to 
form the Overlapping Tree Network (d). 

network of the 100 highest capitalized stocks traded in 
the NYSE from 1995 to 2003. 



III. APPLICATION TO EUROPEAN 
COLLABORATION NETWORK 

Joint scientific research in Europe has been funded 
through large programs called Framework Programs 
(FP). In all FPs partners come from different countries, 
and thus international collaborations are strongly encour- 
aged. In the following we construct the collaboration net- 
works of all countries participating in the FP6, which is 
the last concluded FP that ran during the period 2002- 
2006. Such dataset can be obtained from CORDIS 0- 
The collaboration networks are constructed separately 
for every thematic area, out of a total of 16, by consider- 
ing each country as a node, and by representing the col- 
laborations among countries as links between the nodes. 
This means that an edge connecting two nodes, i and j, 
represents the presence of at least one collaboration be- 
tween institutions from country i with institutions from 
country j . 

At each edge we assign a weight Wij, that represents 
the total number of collaborations between institutions 
of the two countries, on a specific thematic area. We 
transform this weight to a distance measure dij = 1 /wij , 
in such a way that the smaller the distance, the stronger 
the collaboration between countries. By default dtj is 
defined in the interval (0, 1], and takes its maximum value 
when there is only one collaboration between a pair of 
countries, Wij ~ 1. 

The use of Spanning Trees as subnetworks that re- 
tain only the most meaningful connections of the orig- 
inal network, is an approach that has enhanced our un- 
derstanding in various comple x sy stems. Following this 
approach in a previous work [17| we used the MST to 
measure the role of a country in the collaboration net- 
work that resulted due to its participation in FP6. We 
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FIG. 2: (color online) (a) Example of some Minimum Spanning Trees (MSTs) created from the collaboration network of the FP6 
for different instruments, (b) The Overlapping Tree Network (OTN) obtained from merging all the MSTs for all the 16 separate 
activities, (c) The OTN that shows only the collaboration activity between the EU25 countries for different instruments of the 
FP6. With green nodes we represent the EU15 countries and with yellow nodes we represent the 10 new member countries of 
EU, and with red nodes the countries outside the EU. The sizes of the nodes are proportional to degree k of the node. With 
blue lines we represent the links between the EU15 countries, and with red lines we represent the links that connect the 10 new 
EU members to the network. The thickness of the links are proportional to the weight of the link. 



found that each MST of the FP6 collaboration network 
has star like structure around some specific countries, for 
different thematic areas. These countries, that are found 
to act as hubs (strongly connected nodes), are Germany, 
United Kingdom, France, and Italy. More specifically, we 
find that Germany (DE) is the central hub for 62.5% of 
the thematic areas, United Kingdom (UK) 25%, France 
(FR) 6.25%, and Italy (IT) 6.25%. 

Here, for each one of the 16 thematic collaboration 
networks we calculated again these Minimum Spanning 
Trees, and we combined them applying the aforemen- 
tioned technique to obtain the OTN of the collaboration 
network. Examples of MST, where the star like structure 
becomes apparent and the resulting OTN are shown in 
Figured The OTN is still a well connected network, but 
its connections represent only the stronger collaboration 
links, as they are captured by the MTS for every the- 
matic area. In order to get a more detailed view, we zoom 
more to the OTN, by examining only the collaborations 
among the 25 EU member countries. A strongly inter- 
connected community of the most frequent hubs of the 
network is identified. This community is like a nucleus of 
countries with very strong collaboration links, while the 
other countries play only a satellite role around it. 

This picture is much richer in comparison to the MST 
because, not only it highlights the importance of the hubs 
of the network (this information we can get from the MST 
as well [13]) J but it also shows quantitatively the inter- 
connecivity pattern between them, in such a way that a 
fully connected community is formed between DE, UK, 
FR, IT, and ES. This means that these five countries not 



only play important role in the network, but they are con- 
nected among themselves with very strong connections, 
as it is shown by the thickness of the links between them. 
This valuable information is not possible to be extracted 
from the MST since, by definition, the MST does not 
allow loops to be formed. 



IV. APPLICATION TO CORRELATION BASED 
NETWORKS 

We now apply the OTN technique to the network of 
the 100 most capitalized stocks traded in the NYSE, us- 
ing daily returns in the period 1995 - 2003. The basis of 
the creation of an equity network, using a portfolio of N 
stocks, is the analysis of the cross-correlation among time 
series of returns for all pairs of stocks. The correlation 
coefficient provides a similarity measure that can be used 
as weight for a link between each pair of stocks. Thus, a 
correlation based network is a fully connected weighted 
graph, and the weights of the graph are obtained from 
the correlation matrix of the system. The extraction of 
the MST from su ch grap hs gives a wealth of useful infor- 
mation [i,[i3,[il|,[i2[[il, but because the MST is a very 
drastic filtering method, it cannot capture more struc- 
tured entities, such as the communities of stocks con- 
nected with strong links [l^. The need to find filtering 
techniques that will create richer graphs than the MST 
was first addressed by Tumminello et al. with the cre- 
ation of the Planar Maximally Filtered Graph (PMFG) 
technique [l3|. The PMFG is a graph that is created 
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by adding links to a disconnected network in decreasing 
weight order, with the constrain that the resulting graph 
will be planar. In what follows, we will use the OTN tech- 
nique and we will compare our results with the results of 
the PMFG. 

In order to apply the OTN technique for the case of 
financial correlation based networks that originate from 
time series data, as explained in the "Methods" section, 
we divide the time series to a set of smaller segments 
using a sliding time window of length T . The length T of 
the time window is a fraction q = T/Tq of the original size 
of the return time series, for which Tq = 2262 days. The 
time step that we use to move the time window is one day. 
For every time step we calculate the correlation matrix, 
which contains N{N — l)/2 entries, determined from N 
time series of length T. If T is not very large compared 
to N, it is shown using arguments from Random Matrix 
Theory [13,1211 that the determination of the correlations 
is noisy. As a results, in order to have a more reliable 
determination of the correlation matrix, we make sure 
that Q = T/N > 1. 

Using this procedure we calculate the OTNs for dif- 
ferent time window lengths. As an example we show in 
Fig. [3] the OTN that was calculated using time window 
of length T — 1200 days. From this figure we see that 
most of the stocks belonging to the same sector of eco- 
nomic activity are, as expected, clustered together. But 
from the OTN of Fig. [3] we identify how communities of 
stocks belonging to the same sector of economic activ- 
ity are connected, and how strong the links between the 
communities are, just by examining the thickness of these 
links. Furthermore, it is now possible to identify certain 
stocks that have large number of connections with stocks 
from different sectors. Such stocks are the stock of Gen- 
eral Electric (GE), which is the most connected stock of 
the network, the stock of American International Group 
(AIG) that is an insurance and financial conglomerate 
company, etc. 

This shows that these stocks have activities not only 
in their main field of operations but they also extend to 
totaly different fields, as for example GE, which is known 
for electrical machinery (from airplane engines to light 
bulbs) extending to the financial sector with GE Capital. 
Our method is able to extract interdisciplinary activities 
that other methods cannot accomplish effectively. 

A widely used technique that quantifies the tendency 
of the nodes of a graph to cluster is the clustering coeffi- 
cient C{q) m,!!^, which is defined as follows. Assuming 
that a vertex i has neighbors then at most (n^ — 1)/2 
edges can exist between all the neighbors of vertex i. If 
we denote with Ci{q) the fraction of such existing edges 
for node i, then C{q) is defined as the average of Ci{q) 
over all connected nodes of the network. In Fig. [Ja) we 
plot the values of the clustering coefficient obtained for 
OTNs constructed using various time window lengths. 
As it is expected the clustering coefficient C{q) = for 
the case of the MST, since there are no loops in a tree. 
In the same plot the horizontal discontinuous line, for 
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FIG. 3: (color online) OTN obtained from the network of 
the 100 most capitalized stocks traded in the NYSE in the 
period 1995 - 2003. For the calculation a sliding time window 
of 1200 days was used. The different colors represent stocks 
belonging to different sectors of economic activity, according 
to the Standard Industrial Classification (SIC) codes. The 
size of each node is proportional to the degree k of the node. 
The k values of the nodes are inside the interval k £ [1,32]. 
The thickness of each link is proportional to the weight of the 
link. 

comparison reasons, shows the value of the clustering co- 
efficient obtained using the PMFG technique. As we see 
from Fig. [41(a) the PMFG, constructed using the same 
time series length that we used to calculate the MST, 
has always higher clustering coefficient in comparison to 
the OTN. 

The above result is expected, since PMFG is con- 
structed in such way that many small cliques are formed. 
But all these cliques are not necessary meaningful, in 
the sense that a cluster of stocks belonging to the same 
sector is, since the formation of cliques is forced by the 
construction algorithm. On the other hand, clustering 
information is important, because it plays central role 
in understanding the hierarchical structure of an equity 
network [9||, and could be effectively applied in portfolio 
selection |lll |. 

In what follows we introduce an empirical function that 
can be used to measure ho meaningful are the clusters 
that we get by filtering methods applied to correlation 
based networks, if we assume that the best method will 
cluster all stocks of the same sector together. Of course, 
as we discussed above, there are stocks that are strongly 
connected to stocks of different sectors of economic ac- 
tivity, but such stocks are usually the exception, not the 
rule. 

In order to calculate the structure of the network we 
introduce a structure function g. This function can take 
values in the range [1,0), and it obtains the maximum 
value g = 1 for a complete graph that all its nodes belong 
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to the same partition. The g function is defined as, 



(1) 



where is the total number of links of the network, 
iV„ is the total number of nodes of the network, s is the 
number of different partitions (in this case is the number 
of different economic sectors), Nj is the number of nodes 
that belong to partition j, N^^ is the number of different 
clusters that are created from nodes belonging to parti- 
tion j, Nf^ = Nj {Nj — 1) /2 is the number of possible links 
between all the nodes of partition j, rrici is the number 
of nodes that form cluster d, and Id is the number of 
links that connect the nodes of cluster cl. We should 
make clear that in order to use the structure function g, 
the partitioning of the network must be known a priori, 
therefore this function does not give information about 
the partitioning of a graph, it only compares the output 
of different partitioning methods. 

We applied this measure to extract the structural infor- 
mation obtained using the OTN for different time win- 
dows and we compared it with the structural informa- 
tion obtained using the PMFG. The results are shown 
in Fig. m^b), where we see that for large time window 
lengths the OTN is lower in comparison to the PMFG, 
but by decreasing the time window length eventually the 
OTN is able to capture more structural properties than 
the PMFG. This happens because the OTN is based on 
the MST, which is a relatively stable structure. Thus, 
for large time window lengths the OTN includes only a 
small fraction of the links of the original fully connected 
graph, just a few more than the links included in the 
MST. But as we decrease the time window length, we 
capture more information about the evolution of links 
that become strong only for some period of time. Such 
links are included in the OTN and make it optimal for 
studying the network at different time periods. If we 
compare the set of the common links between the MST 
that is calculated using the full length time series, and 
the OTNs that we have calculated for all the time win- 
dow lengths that are shown in Fig. [4l we find that on 
average 98.7 ± 0.2% of the links included in the MST is 
also included in the OTN. This means that almost all the 
connectivity information that we extract from the MST 
is included in the OTN, but the OTN allows to examine 
even more details of the system. 



V. DISCUSSION 

We introduced a general method that extracts informa- 
tion stored in complex networks, by using only a subset 
of the network strongest links, resulting in the Overlap- 
ping Tree Network (OTN). Our method is based upon the 
well established technique of the extraction of the Mini- 
mum Spanning Tree, but it allows for the filtered network 
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FIG. 4: (a) Clustering coefficient C of the OTN obtained 
from the network of the 100 most capitalized stocks traded 
in the NYSE using different sliding time windows versus the 
fraction q of the time series length that falls inside the time 
window, (b) Structure measure g of the OTN obtained from 
the network of the 100 most capitahzed stocks traded in the 
NYSE using different sliding time windows versus the fraction 
q of the time series length that falls inside the time window. In 
both plots the horizontal line stands for the respective values 
of the PMFG obtained from the same data set. The top 
axis show the values of Q, and the shaded area shows the 
area Q < 1 where the correlation matrix of the system is 
dominated by noise. 



to have loops, and therefore it retains more of its origi- 
nal complexity that the MST. The added information in 
the OTN is that besides the clustering together of sim- 
ilar nodes as in the MST, we now have the full picture 
of the connectivity pattern between the hub nodes, and 
the strength of these connections (wij). Such strongly 
interconnected clusters form easily distinguishable com- 
munities that can be used to partition the network. Fur- 
thermore, the OTN includes links that could be strong 
for only a certain time period or for some specific set 
of interactions, and then become weak again. Such dy- 
namical transition is not detected by the use of the MST 
alone. As a consequence the importance of some central 
nodes of the network detected by the MST, such as, for 
example the stock of GE, is highlighted by the use of 
OTN. This method was applied to two different systems 
and gave interesting insights for both cases. 

The first system was the collaboration network of coun- 
tries participating to at least one of European sponsored 
research projects. With this approach we were able to 
identify a fully connected community of the most fre- 
quent hubs of the network. The strength of the internal 
links connecting the nodes of this community are much 
higher than the average strength of the links of the re- 
maining network. Therefore this structure could be de- 
scribed as a nucleus of countries with very strong collabo- 
ration links. All other participating countries were found 
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to play a satellite role around this nucleus. Such infor- 
mation is not palpable by the use of the MST alone be- 
cause with the MST approach we lose information about 
the interconnectivity. But interconnectivity information 
is very important both for policy makers and for scien- 
tists that create consortia and apply for funding to the 
European Commission. 

The second system was the network of the 100 most 
capitalized stocks traded in the NYSE. These stocks form 
a correlation based network, which is calculated using 
time series of the daily returns of these equities. The 
OTN was extracted using different lengths of the slid- 
ing time window. We verified the expected clustering of 
stocks according to the sectors of economic activity that 
they belong, but we were able to identify the stocks of 
General Electric (GE), and the American International 
Group (AIG) as the ones that have the strongest links 
with stocks from different sectors in the set of 100 high- 
est capitalized stocks. Since clustering is an important 
information, we introduced an empirical function that is 
able to quantify the result, in terms of clustering infor- 
mation, of different filtering techniques applied in corre- 



lation based networks. We applied this function to the 
outcome of the OTN and of the PMFG methods. Com- 
parison of these methods showed that the PMFG gives 
better clustering information if the length of the time 
window used for the OTN method is large. For smaller 
time windows the OTN is able to capture more struc- 
tural properties, and it is more suitable for studying the 
network at different time periods. 
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